Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

22_reinforcement_learning_from_q_value_till_the_end #24

Open
wants to merge 40 commits into
base: master
Choose a base branch
from

Conversation

msrazavi
Copy link

@msrazavi msrazavi commented Jan 7, 2022

Q-learning is an **off-policy** learner, which means it learns the value of the optimal policy independently of the agent’s actions. In other words, it converges to optimal policy eventually even if you are acting sub-optimally.
Q-learning is a **sample-based** q-value iteration method and in it, you Learn $Q(s,a)$ values as you go:

- Receive a sample. $(s_{t+1}, s_t, a_t, r_t)$

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please Fix the latex syntax.


$$Q(s,a) \leftarrow (1 - \alpha)Q(s,a) + \alpha(sample)$$

$$\rightarrow Q^{new}(s_t,a_t) \leftarrow \underbrace{Q(s_t,a_t)}\_\text{old value} + \underbrace{\alpha}\_\text{learning rate} . \overbrace{(\underbrace{\underbrace{r_t}\_\text{reward} + \underbrace{\gamma}\_\text{discount factor} . \underbrace{\max_aQ(s_{t+1},a)}\_\text{estimate of optimal future value}}\_\text{new value (temporal difference target)} - \underbrace{Q(s_t,a_t)}\_\text{old value})}^\text{temporal difference}$$

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix syntax


<div id='epsilon-greedy-strategy'><h2> Epsilon greedy strategy </h2></div>

The tradeoff between exploration and exploitation is fundamental. the simplest way to force exploration is using **epsilon greedy strategy**. This method does a random action with a small probability of $\epsilon$ (exploration) and with a probability of $(1 - \epsilon)$ does the current policy action (exploitation).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix syntax


<div id='exploration-functions'><h2> Exploration functions </h2></div>

Another solution is to use **exploration functions**. For example, this function can take a value estimate u and a visit count n, and return an optimistic utility, e.g. $f(u,n) = v + \frac{k}{n}$ . we are counting how many times we did some random action. if it had yet to reach a fixed amount, we should try it more often and if it doesn't return a good output we should just stop exploring it.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix syntax


$$V(s) = \omega_1f_1(s) + \omega_2f_2(s) + ... + \omega_nf_n(s)$$

$$Q(s,a) = \omega_1f_1(s,a) + \omega_2f_2(s,a) + ... + \omega_nf_n(s,a)$$

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix syntax


Q-Learning is a basic form of Reinforcement Learning which uses Q-values (action values) to iteratively improve the behavior of the learning agent.

Q-values are defined for states and actions. $Q(s, a)$ is an estimation of how good is it to take the action a at the state s. This estimation of $Q(s, a)$ will be iteratively computed using the temporal difference update.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix syntax

@@ -0,0 +1,130 @@
<div id='reinforcement-learning-from-q-value-till-the-end'><h1> Reinforcement Learning (from Q Value till the end) </h1></div>

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General Note: please fix the latex syntax for formulas.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants